{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a84528f3",
   "metadata": {},
   "source": [
    "## Langchain的应用（1）\n",
    "目录：\n",
    "1. langchain的overview\n",
    "2. prompt template\n",
    "3. models and output parsers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d74d5135",
   "metadata": {},
   "source": [
    "### 1. 什么是langchain, 为什么需要langchain? \n",
    "- 问题：如何没有langchain会怎么样？\n",
    "- 答案：\n",
    "\n",
    "一个项目可能会包括：\n",
    "- 调用多个不同的大模型（gpt4, 视频生成...)\n",
    "- 向量数据库\n",
    "- 数据类型（读取，trunk的切分...)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "278e3c5b",
   "metadata": {},
   "source": [
    "- langchain是面于大模型开发的框架（framework）\n",
    "- langchain发展很快，讲解课程时候的版本为 0.1.7，具体的语法和接口标准可能会随时改变，请留意官网的documentation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3e1f79b",
   "metadata": {},
   "source": [
    "#### Langchain的核心组件\n",
    "- ```模型 I/O 封装```: 包括大语言模型（LLMs），Chat Models，Prompt Template，Output parser等\n",
    "- ```Retrieval```: 包括文档的loader，embedding模型，Text Splitter, 向量存储，检索等\n",
    "- ```Chain```: 实现一个功能或者一系列功能（sequentially) \n",
    "- ```Agent```: 给定用户的输入，以及可使用的tools，自动规划执行步骤（比如每个步骤调用哪些tool），并最终完成用户指令\n",
    "- ```记忆```: 模型记忆里的管理"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46f1ae3b",
   "metadata": {},
   "source": [
    "#### langchain部分的安排\n",
    "1. langchain (1) -  Langchain的overview，模型I/O封装\n",
    "2. langchain (2) -  Retrieval组件, Chain组件，Agent组件，记忆里模块\n",
    "3. langchain (3) -  进阶RAG+langchain\n",
    "4. langchain (4) -  Agent\n",
    "5. langchain (5) -  经典Agent开源项目剖析\n",
    "6. langchain (6) -  Agent的经典案例分享"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "1390a863",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 相应library的安装， 我们默认安装最新版本\n",
    "#!pip install langchain\n",
    "#!pip install openai\n",
    "#!pip install langchain-openai\n",
    "\n",
    "# 安装完之后，可以查看一下版本号\n",
    "# import openai\n",
    "# print (openai.__version__)\n",
    "# !python -m pip install python-dotenv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "cff759dd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 导入openai api key\n",
    "import os\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "\n",
    "# .env 存储api_key\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad65e2cc",
   "metadata": {},
   "source": [
    "### 2. Langchain的quick overview\n",
    "在这里，我们快速体验一下langchain的各个组件。 请保证相应的library已经安装完毕。 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6f62b5c",
   "metadata": {},
   "source": [
    "两种模型：\n",
    "- ```non-chat model```: 用于text completion, 给定一句话，补全剩下的内容 \n",
    "- ```chat model```: 用于chat, 可以流畅得进行对话的对话模型\n",
    "\n",
    "我们主要关注chat model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ae97019",
   "metadata": {},
   "source": [
    "```A. 模型调用```\n",
    "\n",
    "langchain已经封装好各类模型（开源、闭源）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "82059e28",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "llm = ChatOpenAI()\n",
    "#llm = ChatOpenAI(model_name=\"gpt-4\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "d1345960",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'gpt-3.5-turbo'"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "llm.model_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "c8303580",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AIMessage(content='The Sora model is a framework developed by Richard Culatta that focuses on personalized learning and the integration of technology in education. It stands for Social learning, Ownership of learning, Reflection, and Authentic learning experiences. This model emphasizes the importance of students taking ownership of their learning, engaging in social interactions with peers and experts, reflecting on their learning experiences, and participating in authentic, real-world tasks. The Sora model aims to create a more student-centered and engaging learning environment.')"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 直接提供问题，并调用llm\n",
    "llm.invoke(\"What is the Sora model?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9199def",
   "metadata": {},
   "source": [
    "```B. prompt template```的使用\n",
    "\n",
    "prompt中可以加入变量，让prompt的构造更加灵活"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "b1833770",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 我们也可以创建prompt template, 并引入一些变量到prompt template中，这样在应用的时候更加灵活\n",
    "\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "\n",
    "# 需要注意的一点是，这里需要指明具体的role，在这里是system和用户\n",
    "prompt = ChatPromptTemplate.from_messages([\n",
    "    (\"system\", \"You are the technical writer\"),\n",
    "    (\"user\", \"{input}\")  # {input}为变量\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "29e97f64",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AIMessage(content='The SORA model is a structured approach used in the risk assessment of cybersecurity threats and vulnerabilities. SORA stands for \"Secure-ly Orchestrated, Reliable, and Adaptive.\" It is a methodology developed by the European Union Aviation Safety Agency (EASA) for assessing the safety risks associated with drone operations in the context of U-space (unmanned aircraft systems airspace management). \\n\\nThe SORA model consists of several steps, including defining the operational context, identifying potential hazards, assessing risks, and defining mitigation measures. It aims to ensure the safe integration of drones into airspace by evaluating the risks and implementing appropriate safety measures.\\n\\nOverall, the SORA model provides a systematic framework for assessing cybersecurity risks in drone operations and developing strategies to mitigate these risks effectively.')"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 我们可以把prompt和具体llm的调用和在一起（通过chain，chain可以理解为sequence of calls to take）\n",
    "chain = prompt | llm \n",
    "chain.invoke({\"input\": \"What is the Sora model?\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "dc256018",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The SORA model is a structured approach used in the risk assessment of cybersecurity threats and vulnerabilities. SORA stands for \"Secure-ly Orchestrated, Reliable, and Adaptive.\" It is a methodology developed by the European Union Aviation Safety Agency (EASA) for assessing the safety risks associated with drone operations in the context of U-space (unmanned aircraft systems airspace management). \\n\\nThe SORA model consists of several steps, including defining the operational context, identifying potential hazards, assessing risks, and defining mitigation measures. It aims to ensure the safe integration of drones into airspace by evaluating the risks and implementing appropriate safety measures.\\n\\nOverall, the SORA model provides a systematic framework for assessing cybersecurity risks in drone operations and developing strategies to mitigate these risks effectively.'"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_core.output_parsers import StrOutputParser\n",
    "\n",
    "output_parser = StrOutputParser()  # 输出string\n",
    "chain = prompt | llm | output_parser\n",
    "chain.invoke({\"input\": \"What is the Sora model?\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12e1aab3",
   "metadata": {},
   "source": [
    "```问题```: 大模型对Sora理解不到位，为什么？ 如何解决？\n",
    "\n",
    "使用RAG： 去网上获取最新的关于Sora的内容"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c8aa074",
   "metadata": {},
   "source": [
    "```C. RAG+Langchain```\n",
    "\n",
    "基于外部知识，增强大模型回复"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "f44fff33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install beautifulsoup4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "d7eda827",
   "metadata": {},
   "outputs": [],
   "source": [
    "#  结合关于Sora的technical report来生成更好地答案，分几步：\n",
    "#  第一步： 寻找关于Sora的一些文库，并抓取内容\n",
    "#  第二步： 把文库切块（trunks)并存放到向量数据库中\n",
    "#  第三步： 对于新的问题，我们首选从vector store中提取trunks, 并融合到llm的prompt里\n",
    "\n",
    "from langchain_community.document_loaders import WebBaseLoader\n",
    "loader = WebBaseLoader(\"https://openai.com/research/video-generation-models-as-world-simulators\")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "31d9f02b",
   "metadata": {},
   "outputs": [],
   "source": [
    "#  使用openai embedding\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "2e61ebc3",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install faiss-cpu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "68b461a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "# 使用 recursiveCharacterTextSplitter, 在春节前的课程中讲过其算法\n",
    "text_splitter = RecursiveCharacterTextSplitter()\n",
    "\n",
    "# 把docs切分成trunks，在这里只有一个doc，因为我们只抓取了一个页面；\n",
    "documents = text_splitter.split_documents(docs)\n",
    "\n",
    "# 存放在向量数据库中。把trunk转化成向量时候用的embedding工具为 OpenAIEmbeddings\n",
    "vector = FAISS.from_documents(documents, embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3e9a3ce",
   "metadata": {},
   "source": [
    "1. 给定input，从vector database搜索相似的documents（trunks）\n",
    "2. documents加入到prompt里面（prompt template, 变量比如{context})\n",
    "3. prompt call LLM， LLM返回response(答案)\n",
    "4. 通过output parser得到格式化完之后的结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "0dd64403",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This chain takes a list of documents and formats them all into a prompt, then passes that prompt to an LLM.\n",
    "from langchain.chains.combine_documents import create_stuff_documents_chain\n",
    "\n",
    "prompt = ChatPromptTemplate.from_template(\"\"\"Answer the following question based only on the provided context:\n",
    "\n",
    "<context>\n",
    "{context}\n",
    "</context>\n",
    "\n",
    "Question: {input}\"\"\")\n",
    "\n",
    "document_chain = create_stuff_documents_chain(llm, prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "d6acb291",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import create_retrieval_chain\n",
    "\n",
    "retriever = vector.as_retriever()\n",
    "retrieval_chain = create_retrieval_chain(retriever, document_chain)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "7bf52582",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Sora model is a video generation model that can simulate actions, interact with the world, and generate images and videos based on various prompts and inputs. It is capable of maintaining both short- and long-range dependencies, simulating artificial processes like video games, and generating high-fidelity videos and images of variable durations, resolutions, and aspect ratios.\n"
     ]
    }
   ],
   "source": [
    "response = retrieval_chain.invoke({\"input\": \"What is the Sora model?\"})\n",
    "print(response[\"answer\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "659bcc3d",
   "metadata": {},
   "source": [
    "```D. Agent```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12423478",
   "metadata": {},
   "source": [
    "非Agent：对于一个任务，我们明确制定 1. 2. 3. 4. 每一步都是非常清楚的，提前制定好的，包括调用什么模型，怎么调用。\n",
    "\n",
    "Agent: 更加复杂的任务\n",
    "\n",
    "Agent开发一个APP：\n",
    "\n",
    "项目负责人拆解任务，然后每个任务派发给不同的角色的人\n",
    "\n",
    "假如：\n",
    "提前有一些工具\n",
    "- 视频做编辑的工具\n",
    "- 视频转换成动画的工具\n",
    "- 生成图片的工具\n",
    "- 生成动画视频的工具\n",
    "- TTS的工具：\n",
    "- GPT的工具（输入，输出）\n",
    "- 计算器工具（输入，输出）： 如果想做加减成熟等计算，要用此工具\n",
    "- 编程的工具（输入，输出）\n",
    "- 脚本分镜的工具（输入，输出）\n",
    "- 图片的list转视频工具\n",
    "\n",
    "\n",
    "任务：自动拍摄一个动画类短视频\n",
    "1. GPT的工具：生成脚本（输入，输出）\n",
    "2. 脚本分镜：很长的脚本分成不同的镜头\n",
    "3. 每个分镜生成图片：生成图片的工具\n",
    "4. 图片转换成视频\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "ae3882c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.tools.retriever import create_retriever_tool\n",
    "\n",
    "retriever_tool = create_retriever_tool(\n",
    "    retriever,\n",
    "    \"Sora\",\n",
    "    \"Search for information about Sora. For any questions about Sora, you must use this tool!\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "e5864b25",
   "metadata": {},
   "outputs": [],
   "source": [
    "tools = [retriever_tool]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "990d0406",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "from langchain import hub\n",
    "from langchain.agents import create_openai_functions_agent\n",
    "from langchain.agents import AgentExecutor\n",
    "\n",
    "\n",
    "prompt = hub.pull(\"hwchase17/openai-functions-agent\")\n",
    "llm = ChatOpenAI(model=\"gpt-4\", temperature=0)\n",
    "agent = create_openai_functions_agent(llm, tools, prompt)\n",
    "agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "13faa384",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
      "\u001b[32;1m\u001b[1;3m\n",
      "Invoking: `Sora` with `{'query': 'What is the Sora model?'}`\n",
      "\n",
      "\n",
      "\u001b[0m\u001b[36;1m\u001b[1;3mconsistently through three-dimensional space.Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.Interacting with the world. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.Simulating digital worlds. Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.DiscussionSora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page.We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.AuthorsTim BrooksBill PeeblesConnor HolmesWill DePueYufei GuoLi JingDavid SchnurrJoe TaylorTroy LuhmanEric LuhmanClarence Wing Yin NgRicky WangAditya RameshAcknowledgmentsCitationPlease cite as Brooks, Peebles, et al., and use the following BibTeX for citation: https://openai.com/bibtex/videoworldsimulators2024.bibResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTCompanyAboutBlogCareersCharterSecurityCustomer storiesSafetyOpenAI © 2015 – 2024Terms & policiesPrivacy policyBrand guidelinesSocialTwitterYouTubeGitHubSoundCloudLinkedInBack to top\n",
      "\n",
      "CloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.February 15, 2024More resourcesView Sora overviewVideo generation, Sora, Milestone, ReleaseThis technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,[^1][^2][^3] generative adversarial networks,[^4][^5][^6][^7] autoregressive transformers,[^8][^9] and diffusion models.[^10][^11][^12] These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.Turning visual data into patchesWe take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.[^13][^14] The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.[^15][^16][^17][^18] We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,[^19] and subsequently decomposing the representation into spacetime patches.Video compression networkWe train a network that reduces the dimensionality of visual data.[^20] This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.Spacetime latent patchesGiven a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.Scaling transformers for video generationSora is a diffusion model[^21][^22][^23][^24][^25]; given input noisy patches (and\n",
      "\n",
      "by arranging randomly-initialized patches in an appropriately-sized grid.Scaling transformers for video generationSora is a diffusion model[^21][^22][^23][^24][^25]; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.[^26] Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,[^13][^14] computer vision,[^15][^16][^17][^18] and image generation.[^27][^28][^29]In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.Base compute4x compute32x computeVariable durations, resolutions, aspect ratiosPast approaches to image and video generation typically resize, crop or trim videos to a standard size—e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.Sampling flexibilitySora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.Improved framing and compositionWe empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right) have improved framing.Language understandingTraining text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3[^30] to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.{\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "}Prompting with images and videosAll of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.Animating DALL·E imagesSora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 2[^31] and DALL·E 3[^30] images.A Shiba Inu dog wearing a beret and black turtleneck.Monster Illustration in flat design style of a diverse family of monsters. The group includes a furry brown monster, a sleek black monster with antennas, a spotted green monster, and a tiny polka-dotted monster, all interacting in a playful environment.An image of a realistic cloud that spells “SORA”.In an ornate, historical hall, a massive tidal wave peaks and begins to crash. Two surfers, seizing the moment, skillfully navigate the face of the wave.Extending generated videosSora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.00:0000:20We can use this method to extend a video both forward and backward to produce a seamless infinite loop.Video-to-video editingDiffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit,[^32] to Sora. This technique enables Sora to transform  the styles and environments of input videos zero-shot.Input videochange the setting to be in a lush junglechange the setting to the 1920s with an old school car. make sure to keep the red colormake it go underwaterchange the video setting to be different than a mountain? perhaps joshua tree?put the video in space with a rainbow roadkeep the video the same but make it be wintermake it in claymation animation stylerecreate in the style of a charcoal drawing, making sure to be black and whitechange the setting to be cyberpunkchange the video to a medieval thememake it have dinosaursrewrite the video in a pixel art styleConnecting videosWe can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.Image generation capabilitiesSora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.Close-up portrait shot of a woman in autumn, extreme detail, shallow depth of fieldVibrant coral reef teeming with colorful fish and sea creaturesDigital art of a young tiger under an apple tree in a matte painting style with gorgeous detailsA snowy mountain village with cozy cabins and a northern lights display, high detail and photorealistic dslr, 50mm f/1.2Emerging simulation capabilitiesWe find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.3D consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling\u001b[0m\u001b[32;1m\u001b[1;3mThe Sora model is a large-scale generative model trained on video data. It's a text-conditional diffusion model that operates on spacetime patches of video and image latent codes. Sora can generate videos and images of variable durations, resolutions, and aspect ratios, up to a full minute of high-definition video.\n",
      "\n",
      "Key features of the Sora model include:\n",
      "\n",
      "1. **3D Consistency**: Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.\n",
      "\n",
      "2. **Long-range Coherence and Object Permanence**: Sora can effectively model both short- and long-range dependencies. For example, it can persist people, animals, and objects even when they are occluded or leave the frame.\n",
      "\n",
      "3. **Interacting with the World**: Sora can simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.\n",
      "\n",
      "4. **Simulating Digital Worlds**: Sora is also able to simulate artificial processes, such as video games. It can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity.\n",
      "\n",
      "However, Sora also has limitations. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. Despite these limitations, the capabilities Sora has today suggest that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals, and people that live within them.\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'input': 'What is the Sora model?',\n",
       " 'output': \"The Sora model is a large-scale generative model trained on video data. It's a text-conditional diffusion model that operates on spacetime patches of video and image latent codes. Sora can generate videos and images of variable durations, resolutions, and aspect ratios, up to a full minute of high-definition video.\\n\\nKey features of the Sora model include:\\n\\n1. **3D Consistency**: Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.\\n\\n2. **Long-range Coherence and Object Permanence**: Sora can effectively model both short- and long-range dependencies. For example, it can persist people, animals, and objects even when they are occluded or leave the frame.\\n\\n3. **Interacting with the World**: Sora can simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.\\n\\n4. **Simulating Digital Worlds**: Sora is also able to simulate artificial processes, such as video games. It can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity.\\n\\nHowever, Sora also has limitations. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. Despite these limitations, the capabilities Sora has today suggest that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals, and people that live within them.\"}"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "agent_executor.invoke({\"input\": \"What is the Sora model?\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "522a101c",
   "metadata": {},
   "source": [
    "### 3. PromptTemplate和ChatPromptTemplate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e365c81e",
   "metadata": {},
   "source": [
    "```问题```：这两者有什么区别？"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d657502",
   "metadata": {},
   "source": [
    "A. ```for LLM(base)```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "626955cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'编写一段关于美国留学的小红书宣传文案，需要采用幽默语气'"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "prompt_template = PromptTemplate.from_template(\n",
    "    \"编写一段关于{主题}的小红书宣传文案，需要采用{风格}语气\"\n",
    ")\n",
    "prompt_template.format(主题=\"美国留学\", 风格=\"幽默\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "2502f2c4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'编写一段关于美国留学的小红书宣传文案'"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "prompt_template = PromptTemplate.from_template(\"编写一段关于美国留学的小红书宣传文案\")\n",
    "prompt_template.format()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe7da578",
   "metadata": {},
   "source": [
    "```B. for Chat Model```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "id": "7ebea7cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "\n",
    "chat_template = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"system\", \"你是AI助教，你的名字是{name}.\"),\n",
    "        (\"human\", \"你好\"),\n",
    "        (\"ai\", \"你好，有什么可以帮到您？\"),\n",
    "        (\"human\", \"{user_input}\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "messages = chat_template.format_messages(name=\"张三\", user_input=\"你的名字是什么？\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "86fff242",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AIMessage(content='你好，我的名字是张三，我是你的AI助教。有什么可以帮助你的吗？')"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "llm.invoke(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "id": "088c0ed4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AIMessage(content='我的名字是张三。有什么问题我可以帮您解答呢？')"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain = chat_template | llm\n",
    "chain.invoke({\"name\":\"张三\", \"user_input\":\"你的名字是什么？\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "id": "a8c401aa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "我的名字是张三。有什么问题我可以帮您解答？"
     ]
    }
   ],
   "source": [
    "# query --> 大模型 ---> response\n",
    "# w w w w w w w ...\n",
    "for chunk in llm.stream(messages):\n",
    "    print(chunk.content, end=\"\", flush=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "5c6672e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# llm.invoke(messages)\n",
    "# chain = messages | llm\n",
    "# chain.invoke({\"name\":\"张三\", \"user_input\":\"你的名字是什么？\"})\n",
    "# for chunk in llm.stream(messages):\n",
    "#    print(chunk.content, end=\"\", flush=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b5fede8",
   "metadata": {},
   "source": [
    "```C. Few shot prompte templates for given examples```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "af70bb73",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import (\n",
    "    ChatPromptTemplate,\n",
    "    FewShotChatMessagePromptTemplate,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "877f79df",
   "metadata": {},
   "outputs": [],
   "source": [
    "examples = [\n",
    "    {\"input\": \"2+2\", \"output\": \"4\"},\n",
    "    {\"input\": \"2+3\", \"output\": \"5\"},\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "id": "0e796aef",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Human: 2+2\n",
      "AI: 4\n",
      "Human: 2+3\n",
      "AI: 5\n",
      "Human: 2+3\n",
      "AI: 5\n"
     ]
    }
   ],
   "source": [
    "# This is a prompt template used to format each individual example.\n",
    "example_prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"human\", \"{input}\"),\n",
    "        (\"ai\", \"{output}\"),\n",
    "    ]\n",
    ")\n",
    "few_shot_prompt = FewShotChatMessagePromptTemplate(\n",
    "    example_prompt=example_prompt,\n",
    "    examples=examples,\n",
    ")\n",
    "\n",
    "print(few_shot_prompt.format())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "id": "699407b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "final_prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"system\", \"You are a wondrous wizard of math.\"), # instructions\n",
    "        few_shot_prompt,  # few shot examples \n",
    "        (\"human\", \"{input}\"),  # input\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "id": "56dbbdd9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AIMessage(content='10')"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain = final_prompt | llm\n",
    "\n",
    "chain.invoke({\"input\": \"4+6\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e3a52b9",
   "metadata": {},
   "source": [
    "```D. Few shot prompt template for dynamic examples```\n",
    "\n",
    "问题：why? \n",
    "请自行查看 documentation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8028c9c8",
   "metadata": {},
   "source": [
    "```E. Cache```\n",
    "\n",
    "对于之前问题的答案，直接从cache中返回，减少成本、提高效率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "id": "ca013577",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 19.9 ms, sys: 4.42 ms, total: 24.3 ms\n",
      "Wall time: 3.19 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "AIMessage(content='为什么冰箱总是笑得开心？\\n\\n因为它有很多冷笑话！')"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "from langchain.cache import InMemoryCache\n",
    "from langchain.globals import set_llm_cache\n",
    "\n",
    "set_llm_cache(InMemoryCache())\n",
    "\n",
    "# 第一次，需要直接调用，需要消耗时间\n",
    "llm.invoke(\"讲一个冷笑话\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "id": "3b8cb0aa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.71 ms, sys: 174 µs, total: 1.88 ms\n",
      "Wall time: 2.64 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "AIMessage(content='为什么冰箱总是笑得开心？\\n\\n因为它有很多冷笑话！')"
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "# 第二次调用，直接从cache中获取\n",
    "llm.invoke(\"讲一个冷笑话\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a50b3b68",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a85acd4c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "1b7f4e31",
   "metadata": {},
   "source": [
    "```F. 追踪token的使用``` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "63515eac",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.callbacks import get_openai_callback\n",
    "from langchain_openai import ChatOpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "id": "94685dab",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tokens Used: 57\n",
      "\tPrompt Tokens: 16\n",
      "\tCompletion Tokens: 41\n",
      "Successful Requests: 1\n",
      "Total Cost (USD): $0.00294\n"
     ]
    }
   ],
   "source": [
    "llm = ChatOpenAI(model_name=\"gpt-4\")\n",
    "\n",
    "with get_openai_callback() as cb:\n",
    "    result = llm.invoke(\"最近心情怎么样？\")\n",
    "    print(cb)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "id": "a1013e71",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tokens Used: 89\n",
      "\tPrompt Tokens: 22\n",
      "\tCompletion Tokens: 67\n",
      "Successful Requests: 2\n",
      "Total Cost (USD): $0.00468\n"
     ]
    }
   ],
   "source": [
    "with get_openai_callback() as cb:\n",
    "    result = llm.invoke(\"Tell me three jokes\")\n",
    "    result2 = llm.invoke(\"Tell me a joke\")\n",
    "    print(cb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf53aa7",
   "metadata": {},
   "source": [
    "```G. Output Parsing```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5bfd17e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "485922dc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ba1b530",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8fb080c9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80ddfabe",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
